This dataset was obtained from Kaggle.This dataset contains health and risk factors related to the heart disease. The dataset involves parameters such as diabetes,stress level, smoking,age,gender among others to analyse the risk of heart disease and contribute effectively to health research. This dataset can be used to study the factors that contribute to the development of heart diseases.
The purpose of the collection of this data was to
Find out the factors that influence the risk of heart diseases
Explore the relationship between various risk factors
The dataset was sourced from Kaggle, specifically from the Heart Disease dataset provided by Oktay Rdeki. It compiles health records and survey data from various patients, detailing risk factors and their association with heart disease. The dataset was collected from medical records and health surveys from 10,000 patients over a period of five years in multiple hospitals across the United States. It includes self-reported behaviors (e.g., smoking, alcohol consumption) and medically recorded variables (e.g., cholesterol level, blood pressure). It consist of 9 numerical and 12 categorical variables.
data<- read.csv("https://raw.githubusercontent.com/Christina-tinaa/Heart-Disease/main/heart_disease.csv")
library(tidyverse)
str(data)
'data.frame': 10000 obs. of 21 variables:
$ Age : num 56 69 46 32 60 25 78 38 56 75 ...
$ Gender : chr "Male" "Female" "Male" "Female" ...
$ Blood.Pressure : num 153 146 126 122 166 152 121 161 135 144 ...
$ Cholesterol.Level : num 155 286 216 293 242 257 175 187 291 252 ...
$ Exercise.Habits : chr "High" "High" "Low" "High" ...
$ Smoking : chr "Yes" "No" "No" "Yes" ...
$ Family.Heart.Disease: chr "Yes" "Yes" "No" "Yes" ...
$ Diabetes : chr "No" "Yes" "No" "No" ...
$ BMI : num 25 25.2 29.9 24.1 20.5 ...
$ High.Blood.Pressure : chr "Yes" "No" "No" "Yes" ...
$ Low.HDL.Cholesterol : chr "Yes" "Yes" "Yes" "No" ...
$ High.LDL.Cholesterol: chr "No" "No" "Yes" "Yes" ...
$ Alcohol.Consumption : chr "High" "Medium" "Low" "Low" ...
$ Stress.Level : chr "Medium" "High" "Low" "High" ...
$ Sleep.Hours : num 7.63 8.74 4.44 5.25 7.03 ...
$ Sugar.Consumption : chr "Medium" "Medium" "Low" "High" ...
$ Triglyceride.Level : num 342 133 393 293 263 126 107 228 317 199 ...
$ Fasting.Blood.Sugar : num NA 157 92 94 154 91 85 111 103 96 ...
$ CRP.Level : num 12.97 9.36 12.71 12.51 10.38 ...
$ Homocysteine.Level : num 12.39 19.3 11.23 5.96 8.15 ...
$ Heart.Disease.Status: chr "No" "No" "No" "No" ...
summary(data)
Age Gender Blood.Pressure Cholesterol.Level
Min. :18.0 Length:10000 Min. :120.0 Min. :150.0
1st Qu.:34.0 Class :character 1st Qu.:134.0 1st Qu.:187.0
Median :49.0 Mode :character Median :150.0 Median :226.0
Mean :49.3 Mean :149.8 Mean :225.4
3rd Qu.:65.0 3rd Qu.:165.0 3rd Qu.:263.0
Max. :80.0 Max. :180.0 Max. :300.0
NA's :29 NA's :19 NA's :30
Exercise.Habits Smoking Family.Heart.Disease Diabetes
Length:10000 Length:10000 Length:10000 Length:10000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
BMI High.Blood.Pressure Low.HDL.Cholesterol High.LDL.Cholesterol
Min. :18.00 Length:10000 Length:10000 Length:10000
1st Qu.:23.66 Class :character Class :character Class :character
Median :29.08 Mode :character Mode :character Mode :character
Mean :29.08
3rd Qu.:34.52
Max. :40.00
NA's :22
Alcohol.Consumption Stress.Level Sleep.Hours Sugar.Consumption
Length:10000 Length:10000 Min. : 4.001 Length:10000
Class :character Class :character 1st Qu.: 5.450 Class :character
Mode :character Mode :character Median : 7.003 Mode :character
Mean : 6.991
3rd Qu.: 8.532
Max. :10.000
NA's :25
Triglyceride.Level Fasting.Blood.Sugar CRP.Level Homocysteine.Level
Min. :100.0 Min. : 80.0 Min. : 0.003647 Min. : 5.000
1st Qu.:176.0 1st Qu.: 99.0 1st Qu.: 3.674126 1st Qu.: 8.723
Median :250.0 Median :120.0 Median : 7.472164 Median :12.409
Mean :250.7 Mean :120.1 Mean : 7.472201 Mean :12.456
3rd Qu.:326.0 3rd Qu.:141.0 3rd Qu.:11.255592 3rd Qu.:16.141
Max. :400.0 Max. :160.0 Max. :14.997087 Max. :19.999
NA's :26 NA's :22 NA's :26 NA's :20
Heart.Disease.Status
Length:10000
Class :character
Mode :character
# Define the feature variables with their descriptions and data types
features <- data.frame(
Feature_Name = c("Age", "Gender", "Blood Pressure", "Cholesterol Level", "Exercise Habits",
"Smoking", "Family Heart Disease", "Diabetes", "BMI", "High Blood Pressure",
"Low HDL Cholesterol", "High LDL Cholesterol", "Alcohol Consumption",
"Stress Level", "Sleep Hours", "Sugar Consumption", "Triglyceride Level",
"Fasting Blood Sugar", "CRP Level", "Homocysteine Level", "Heart Disease Status"),
Description = c("The individual's age", "The individual's gender (Male or Female)",
"The individual's blood pressure (systolic)", "The individual's total cholesterol level",
"The individual's exercise habits (Low, Medium, High)", "Whether the individual smokes or not (Yes or No)",
"Whether there is a family history of heart disease (Yes or No)", "Whether the individual has diabetes (Yes or No)",
"The individual's body mass index", "Whether the individual has high blood pressure (Yes or No)",
"Whether the individual has low HDL cholesterol (Yes or No)", "Whether the individual has high LDL cholesterol (Yes or No)",
"The individual's alcohol consumption level (None, Low, Medium, High)",
"The individual's stress level (Low, Medium, High)", "The number of hours the individual sleeps",
"The individual's sugar consumption level (Low, Medium, High)", "The individual's triglyceride level",
"The individual's fasting blood sugar level", "The C-reactive protein level (a marker of inflammation)",
"The individual's homocysteine level (an amino acid that affects blood vessel health)",
"The individual's heart disease status (Yes or No)"),
Data_Type = c("Numerical", "Categorical", "Numerical", "Numerical", "Categorical",
"Categorical", "Categorical", "Categorical", "Numerical", "Categorical",
"Categorical", "Categorical", "Categorical",
"Categorical", "Numerical", "Categorical", "Numerical",
"Numerical", "Numerical", "Numerical", "Categorical")
)
# Print the table
kable(features)
| Feature_Name | Description | Data_Type |
|---|---|---|
| Age | The individual’s age | Numerical |
| Gender | The individual’s gender (Male or Female) | Categorical |
| Blood Pressure | The individual’s blood pressure (systolic) | Numerical |
| Cholesterol Level | The individual’s total cholesterol level | Numerical |
| Exercise Habits | The individual’s exercise habits (Low, Medium, High) | Categorical |
| Smoking | Whether the individual smokes or not (Yes or No) | Categorical |
| Family Heart Disease | Whether there is a family history of heart disease (Yes or No) | Categorical |
| Diabetes | Whether the individual has diabetes (Yes or No) | Categorical |
| BMI | The individual’s body mass index | Numerical |
| High Blood Pressure | Whether the individual has high blood pressure (Yes or No) | Categorical |
| Low HDL Cholesterol | Whether the individual has low HDL cholesterol (Yes or No) | Categorical |
| High LDL Cholesterol | Whether the individual has high LDL cholesterol (Yes or No) | Categorical |
| Alcohol Consumption | The individual’s alcohol consumption level (None, Low, Medium, High) | Categorical |
| Stress Level | The individual’s stress level (Low, Medium, High) | Categorical |
| Sleep Hours | The number of hours the individual sleeps | Numerical |
| Sugar Consumption | The individual’s sugar consumption level (Low, Medium, High) | Categorical |
| Triglyceride Level | The individual’s triglyceride level | Numerical |
| Fasting Blood Sugar | The individual’s fasting blood sugar level | Numerical |
| CRP Level | The C-reactive protein level (a marker of inflammation) | Numerical |
| Homocysteine Level | The individual’s homocysteine level (an amino acid that affects blood vessel health) | Numerical |
| Heart Disease Status | The individual’s heart disease status (Yes or No) | Categorical |
This dataset will be used to:
Identify significant risk factors associated with heart disease.
Examine interactions between lifestyle, biochemical markers, and cardiovascular health.
Develop predictive models for heart disease risk assessment.
Several features have missing values, particularly in health-related metrics.Missing data can impact modeling accuracy and must be addressed via imputation techniques.
Preparing for analysis requires the following tasks:
heart_disease has missing values for some variables,
these missing values are as follows Age,
cholestrol level, blood pressure,
BMI, Sleep hours,
cholestrol level, Triglyceride.Level,
Fasting.Blood.Sugar, CRP Level, and
Homocysteine.Level. To handle observations with missing
data, imputation will be performed prior to analysis.
# PLOTTING MISSING VALUES
# Generating data frame of missing values per variable
MissingVal <- data.frame(
Variables = names(data),
Missing = colSums(is.na(data))
)
# Generating interactive plot using plotly
Plot_MissingVal <-
# Taking a subset of MissingVal, so only entries with > 0 missing values will be displayed
subset(MissingVal, Missing > 0) %>%
# Passing the subset to plot_ly
plot_ly(
x = ~Variables,
y = ~Missing
) %>%
layout(
title = list(
text = "Missing Values per Variable"
),
xaxis = list(
title = "Variables with Missing Values",
categoryorder = "trace"
),
yaxis = list(
title = "Number of Missing Values"
)
)
# Outputting plot
Plot_MissingVal
Of particular note is the wide variability in Triglyceride Level and Fasting Blood Sugar. Typical triglyceride levels are classified as follows: normal (<150 mg/dL), borderline high (150–199 mg/dL), high (200–499 mg/dL), and very high (>500 mg/dL). For fasting blood sugar, normal levels range between 70–100 mg/dL, while levels above 125 mg/dL are indicative of diabetes.
However, there are some Triglyceride Level values exceeding 400 mg/dL and Fasting Blood Sugar values exceeding 300 mg/dL. Such extreme values may reflect individuals with severe underlying conditions (e.g., hypertriglyceridemia or poorly managed diabetes). If these values do not align with medical plausibility or diagnostic thresholds, they could also represent measurement errors or data entry issues.
# ====================
# PLOTTING ALL NUMERICAL NON-BINARY VARIABLES
# ====================
# Selecting only numeric variables
Numeric_Var <- select(data, where(is.numeric))
# Eliminating any binary variables
Numeric_Var <- Numeric_Var[!apply(Numeric_Var, 2, function(x){all(match(x, c(0, 1, NA), nomatch = FALSE))})]
# Preparing a list of subplots
Numeric_Fig <- c()
# Using a for loop to generate a subplot per variable in Numeric_Var
for(i in 1:length(names(Numeric_Var))){
Numeric_Fig[[i]] <- plot_ly(
x = Numeric_Var[[i]],
y = "",
type = "box",
name = colnames(Numeric_Var)[i]
)
}
# Generating a plot that contains 8 subplots (one for each variable in Numeric_Var) across 4 rows
Plot_Numeric_Var <-
subplot(Numeric_Fig[[1]], Numeric_Fig[[2]], Numeric_Fig[[3]], Numeric_Fig[[4]], Numeric_Fig[[5]], Numeric_Fig[[6]], Numeric_Fig[[7]], Numeric_Fig[[8]], nrows = 4, margin = 0.05) %>%
layout(
title = "Distributions of All Numerical Non-binary Variables",
legend = list(
title = list(text = "<b> Variable </b>"),
bgcolor = "yellow",
bordercolor = "orange",
borderwidth = 2
)
)
# Outputting plot
Plot_Numeric_Var
Cholesterol Level and Triglyceride LevelThere appears to be no strong visible linear trend between cholesterol and triglyceride levels. Both variables have a wide distribution across all values.
# Generating plot
CT <-
plot_ly(
data = data,
x = ~Cholesterol.Level,
y = ~Triglyceride.Level
) %>%
layout(
title = "Cholesterol Level and Triglyceride Level",
xaxis = list(title = "Cholesterol Level"),
yaxis = list(title = "Triglyceride Level")
)
# Outputting plot
CT
Gender
and Exercise HabitsBoth males and females show similar patterns in exercise habits, with the highest count in “high exercise” and similar distributions across “medium” and “low” categories.
# Preparing plot data
Gender_Exercise <-
data %>%
group_by(Gender, Exercise.Habits) %>%
summarise(Count = n())
# Generating plot
GE <- plot_ly()
GE <- GE %>%
add_trace(
data = subset(Gender_Exercise, Exercise.Habits == "Low"),
x = ~Gender,
y = ~Count,
name = "Low"
) %>%
add_trace(
data = subset(Gender_Exercise, Exercise.Habits == "Medium"),
x = ~Gender,
y = ~Count,
name = "Medium"
) %>%
add_trace(
data = subset(Gender_Exercise, Exercise.Habits == "High"),
x = ~Gender,
y = ~Count,
name = "High"
) %>%
layout(
title = "Gender and Exercise Habits",
xaxis = list(title = "Gender"),
yaxis = list(title = "Count"),
legend = list(
title = list(text = "<b> Exercise Habits </b>"),
bgcolor = "#E2E2E2",
bordercolor = "#FFFFFF",
borderwidth = 2
)
)
# Outputting plot
GE
Alcohol Consumption and
Heart Disease StatusMost individuals who consume alcohol at low, medium, or no levels do not have heart disease. However, the number of individuals with heart disease is consistent across these categories.
# Preparing plot data
Alcohol_HeartDisease <-
data %>%
group_by(Alcohol.Consumption, Heart.Disease.Status) %>%
summarise(Count = n())
# Generating plot
AH <- plot_ly() %>%
add_trace(
data = Alcohol_HeartDisease,
x = ~Alcohol.Consumption,
y = ~Count,
type = "bar",
color = ~Heart.Disease.Status
) %>%
layout(
title = "Alcohol Consumption and Heart Disease Status",
xaxis = list(title = "Alcohol Consumption"),
yaxis = list(title = "Count"),
legend = list(
title = list(text = "<b> Heart Disease Status </b>"),
bgcolor = "#E2E2E2",
bordercolor = "#FFFFFF",
borderwidth = 2
)
)
# Outputting plot
AH
Cholesterol Level, Triglyceride Level, and
BMIThe 3D scatterplot shows no clear clustering or pattern among these three variables. BMI, cholesterol, and triglycerides are spread widely across all ranges.
# Generating plot
CTB <- plot_ly() %>%
add_trace(
data = data,
x = ~Cholesterol.Level,
y = ~Triglyceride.Level,
z = ~BMI,
marker = list(size = 2),
hovertemplate = paste(
"<b>Cholesterol Level</b>: %{x}<br>",
"<b>Triglyceride Level</b>: %{y}<br>",
"<b>BMI</b>: %{z}"
),
name = ""
) %>%
layout(
title = "Cholesterol Level, Triglyceride Level, and BMI",
scene = list(
xaxis = list(title = "Cholesterol Level"),
yaxis = list(title = "Triglyceride Level"),
zaxis = list(title = "BMI"),
aspectmode = "cube"
)
)
# Outputting plot
CTB